Wednesday, April 09, 2008

Why I like Stata... reason #9

I'm evaluating West Virginia's benchmark performance on the Interstate Compact on the Placement of Children (or ICPC). This compact establishes agreements between States so that one may place a child into foster or kinship care in another. It's necessary, because there are many steps that are required before a child can be placed (background checks, home studies, etc)... Intrastate placement is comparatively easy because the State law governs all agencies within the State's borders. If I want to place a child in an adjacent county & State law dictates that county social services will take care of the background checks, homestudies, etc, then we (theoretically) have no problems. But if I wanted to place the child in Pennsylvania, I'd have no means of compelling their social services to do this work and I wouldn't be able to do it myself, because I'm not licensed in that State. The ICPC facilitates state-to-state agreements.

I received a bunch of data from our State's ICPC office a few weeks ago. I'm using these data to calculate performance indicators (mean or modal # of days from benchmark 1 to 2, estimates of variance, etc).

In this line of work delays are common. For that reason, the federal government has been trying to figure out ways to structure incentives to minimize delays. However, they drafted the legislation without really understanding the dynamics that contributes to them (shocking, I know). In this State, they've been proactive in gathering data to document reasons for delay. The spreadsheet that they sent me includes a column labeled delay and containing the following codes.

delay
1
1 & 2
1 & 3
1 & 5
1 &2
1&3
1, 2, & 3
1, 2 & 3
1, 2 & 3
1, 2 & 4
1, 2, & 3
1, 2, 3 &5
1, 3 & 5
1, 3 & 5
1,2 & 3
1,2,& 3
1,4 & 5
2
2 & 3
3
3 & 5
3 &5
4
5

There are six codes for delays (code 6 was never used in these data). To make these data analytically useful, I have to create binary indicator variables (0/1) for each delay condition.

Now if I were using SPSS bad software, I'd have to write approximately 25 IF/Then Statements with the proper substring parsing to convert these data into something that's useful. This is the sort of thing that we did a lot of at the old data farm (and it's one of the reasons I looked for employment away from the data farm, ugh).

But Stata has implemented regular expression substring functions. While regular expressions are truly cool, I don't need to use a complicated pattern match here. I simply need to know, for each of my six indicator variables, if the matching digit is somewhere in the string. Take #3. If you look at the list above, you'll notice that sometimes 3 is in the first position (3 & 5), sometimes it's in the middle (1, 3 & 5) and sometimes it's last. Moreover, there's lots of uneven text in this string field (spaces, commas, the & sign, etc). I just want to know if code 3 applies for a given case. Stata lets me do that efficiently and elegantly.

foreach x of numlist 1/6 {
gen delay`x' = (regexm(delay,"`x'"))
}

This is a simple loop that cycles through a list of values. For more about the power of forloops in Stata, consult Nick Cox's excellent article on the topic. This particular loop initiated with 1, does what I ask it to do, increments to 2, does what I ask it to do, and continues to iterate until it reaches six. Below is my translation of what this concise little loop did.
Translation:
Dear Stata, for each digit in the series 1,2,3,4,5 & 6, please generate six new variables, one for each digit in the series. Name the variables delay1, delay2... delay6. Please use the regular expression match function to examine the original string variable (called delay) and indicate if the reference digit is present. If it is, please follow the rules of regexm (which is to return a 1 if true and 0 if false). Thank you and have a lovely day.

The results:

. tab delay1

delay1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 281 87.00 87.00
1 | 42 13.00 100.00
------------+-----------------------------------
Total | 323 100.00

. tab delay2

delay2 | Freq. Percent Cum.
------------+-----------------------------------
0 | 262 81.11 81.11
1 | 61 18.89 100.00
------------+-----------------------------------
Total | 323 100.00

. tab delay3

delay3 | Freq. Percent Cum.
------------+-----------------------------------
0 | 250 77.40 77.40
1 | 73 22.60 100.00
------------+-----------------------------------
Total | 323 100.00

you get the point.. it worked

So now, I have functional indicator variables that I can plug into my models and be all quantitative and stuff.

Why did I post this indictment of my geekiness? Because, I know many people are doing applied social research based on uncleaned administrative data. This is information often gathered for purposes other than analysis. If you get to know the programming capabilities of your statistics package of choice (yes, even SPSS) you can save yourself a heck of a lot of work. Avoid the impulse to use point and click menus. There is no GUI guidance from Stata on how to put together this loop; had I relied on the GUI I'd never get this thing done.

That is all.